Generalisable Overview of Study Risk for Lead Investigators Needing Guidance (GOSLING): A data governance risk tool

Introduction Digitisation of patient records, coupled with a moral imperative to use routinely collected data for research, necessitate effective data governance that both facilitates evidence-based research and minimises associated risks. The Generalisable Overview of Study Risk for Lead Investigators Needing Guidance (GOSLING) provides the first quantitative risk-measure for assessing the data-related risks of clinical research projects. Methods GOSLING employs a self-assessment designed to standardise risk assessment, considering various domains, including data type, security measures, and public co-production. The tool categorises projects into low, medium, and high-risk tiers based on a scoring system developed with the input of patient and public members. It was validated using both real and synthesised project proposals to ensure its effectiveness at triaging the risk of requests for health data. Results The tool effectively distinguished between fifteen low, medium, and high-risk projects in testing, aligning with subjective expert assessments. An interactive interface and an open-access policy for the tool encourage researchers to self-evaluate and mitigate risks prior to submission for data governance review. Initial testing demonstrated its potential to streamline the review process by identifying projects that may require less scrutiny or those that pose significant risks. Discussion GOSLING represents the first quantitative approach to measuring study risk, answering calls for standardised risk assessments in using health data for research. Its implementation could contribute to advancing ethical data use, enhancing research transparency, and promoting public trust. Future work will focus on expanding its applicability and exploring its impact on research efficiency and data governance practices.


Introduction
The expansion of digital health records and biomedical datasets, coupled with the development of novel analytical techniques such as machine learning, presents an unprecedented opportunity for clinical research.However, digitisation of patient data also introduces complex challenges in data governance, privacy, and ethical considerations.Digital records can be easily duplicated and transmitted across platforms and borders, increasing the risk of unauthorised access and complicating the enforcement of consistent privacy standards.As the healthcare sector navigates the delicate balance between leveraging data for innovation and safeguarding patient privacy, the need for effective data governance tools has become increasingly apparent.The use of deidentified routinely collected data in healthcare research has become increasingly significant, offering substantial benefits for public health insights, policy making, and personalised medicine development.De-identification involves removing or modifying personal information from health records to protect individuals' privacy, allowing researchers to access valuable datasets without compromising patient confidentiality.The distinction between deidentified and anonymised data is subtle, whereby de-identification involves masking identifiers to prevent identification without additional information, whereas anonymisation would irreversibly detach potentially identifiable labels from the data, so that data cannot be linked back to an individual, even with additional information.
Access to routinely collected medical data for clinical research requires a valid, legal basis.In the absence of explicit consent from patients, the legal basis commonly used for health research by academic institutions is for a "public task" [1].Special category data may be processed under the condition of "Archiving, research and statistics" [2].This access to routinely collected medical data relies upon the fundamental assumption that the processing of personal health information in research and development will improve patient care.The NHS England Constitution promises to use de-identified routinely collected data for this purpose [3].It is well-documented that risk aversion and complicated data governance processes in the UK may act as a barrier to research and development that relies upon secondary usage of routinely collected health data [4].
Risks associated with usage of health data for research can broadly be categorized into three main types: legal, governance, and reputational risks.Legal risk pertains to the potential for legal consequences arising from data use, such as breaches of data protection laws.Governance risk involves the potential for non-compliance with institutional policies and guidelines.Reputational risk relates to the potential damage to the institution's or researchers' reputation due to perceived mishandling or misuse of data.This can affect public trust and the willingness of individuals to participate in future research.In reality, there is considerable overlap between these as "risky" practices will likely have legal, governance and reputational implications.
Whilst there is a wealth of literature offering core principles for the function of data access committees and overarching advice, this study presents the first open-access framework for assessing the data-related risks of a project [5,6].Although there has been previous work seeking to audit compliance with privacy, data governance and ethical principles at an institutional level, such as through the Privacy and Ethics Impact and Performance Assessment (PEIPA), there is still no consensus method of assessing whether specific projects are suitable for accessing routinely collected health data [7].In the United Kingdom, there have been calls for increased resources to aid with data governance approvals as the current process can be unclear, convoluted and burdensome [4].Furthermore, data governance processes are widely heterogeneous between countries and differ between institutions, even when common principles are being followed [8].
This study combines guidance from General Data Protection Regulation (GDPR), Information Commissioner's Office, General Medical Council, Department for Digital, Culture, Media and Sport (DCMS), Five Safes, Health Insurance Portability and Accountability Act (HIPAA), Health Research Authority and the National Health Service [9][10][11][12][13][14][15][16].Experts in data governance from Cambridge University Hospitals, University Hospitals Birmingham and the Cancer Research UK Cambridge Centre also inputted on the design of the tool, alongside members of the public.The tool focusses on legal, governance and reputational risk factors, rather than ethical factors.The tool is not designed to replace data governance committee reviews, but rather to be used as a triage system, to identify low and high-risk projects.Projects which are flagged by the tool as low-risk may require little scrutiny by a data access committee whereas those which are high-risk must be interrogated further.By releasing the tool in an open-access fashion with full transparency of scoring weights for each question, researchers will be able to selfassess their project and make changes to reduce the risk score before submission.Although the tool is released with a set of pre-determined questions, model weights and score thresholds, individual institutions will be able to modify these if required.

Risk tool design and scoring
The scoring self-assessment was designed to minimise the number of free-text questions and, where possible, force users to select an option.The questions were grouped under 11 domains: Each question was assigned a weight which would either increase (higher risk) or decrease (lower risk) the cumulative data risk score by a set multiplier depending on the response.A starting total score of 50 was set to allow for reductions in score.Scoring weights were determined by consensus between the study group, with input from patient and public members.The use of free-text questions was minimised as it would be infeasible for these to contribute to scoring in an automated fashion.
The risk tool was tested on a sample of 15 studies, which included real and synthesised data request proposals.Real proposals were sourced from Cambridge University Hospitals, University Hospitals Birmingham and the CRUK Cambridge Centre.Including synthesised examples was particularly important to test the effectiveness of the tool at identifying very high-risk projects, which are uncommonly received in the real-world.Scoring weights for individual questions were refined to ensure that the tool was able to differentiate between low, medium and high-risk proposals.Thresholds for low, medium and high risk interpretations were agreed by consensus by the study team based on the results of the testing.
This study did not include research involving human participants, tissue, animals or plants.

Public and patient involvement and engagement
Public and patient involvement and engagement (PPIE) was facilitated by the National Institute for Health and Care Research (NIHR) Cambridge Biomedical Research Centre (BRC).Nine public members, selected by the NIHR Cambridge BRC, anonymously provided feedback on the study, data risk tool and score weights for each question.All suggestions were implemented and communicated to the PPIE members, with no further revisions requested.

Results
Table 1 includes the questions and scoring weights in the final GOSLING model.An interactive spreadsheet, which automates scoring, is included as Supplementary File 1.A publicly available interface is accessible at https://datarisktool.shinyapps.io/RiskScore/ and the code to generate or modify this is accessible at: https://github.com/anmolarora-98/datarisktool/In the UK, hospitals are often affiliated with a local university in the form of teaching hospitals and the tool accounts for this.The tool was piloted on 15 real and synthesised projects to review the face validity of the scoring and to ensure the questions were clear for researchers.These ranged from simple projects where data was contained within the host organisation to much riskier projects, involving special category data (such as genomic or biometric data) being shared with commercial partners.Novel research methodologies such as federated learning were also included in testing.Federated learning projects would be expected to fall in the 'medium risk' range, because it was expected that they should require discussion at a data governance meeting but that they are likely to be approved, providing that data is not leaving the host institution and if there are no concerning risk factors.A sample of results of the testing on nine diverse synthesised projects is shown in Table 2. Due to the sensitive nature of the real projects, the results are not being publicly published.Testing on real projects yielded comparable results, although there were few high-scoring outliers compared to the range of theoretical projects, which were created to assess whether GOSLING could pick up particularly high-and low-risk projects.
A consensus meeting was held on 26 th February 2024, within the study group, during which the performance of the tool during this pilot was reviewed.The study group collectively agreed upon a relative ranking of the projects based on risk and a subjective assessment of whether they were low, medium or high risk.They were not blinded to the GOSLING score.Based on the performance on the synthesised and real projects, the following scoring thresholds were determined for interpretation of the score: • 0 to 30: Low-risk: Very likely to be approved, unlikely to require in-depth review  The project has ethical approval and patients consent to researchers accessing their routinely collected data for the purposes of this particular project

Please select x 2c
The project has a favourable opinion from an NHS REC AND a section 251 consent waiver approval from the Confidentiality Advisory Group

Please select x 2d
The project involves members of the hospital clinical care team who will access identifiable information in order to deidentify the data for analysis

Discussion
The GOSLING data risk tool provides, to our knowledge, the first quantitative risk evaluation of access to routinely collected health data, confirming to overarching standards of proportionality and data minimisation.The tool supports the data minimisation principle as expressed in Article 5(1)(c) of the GDPR and Article 4(1)(c) of Regulation (EU) 2018/1725.The data minimisation principle dictates that the use of personal data must be "adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed."[17] Proportionality is a well-established principle within data governance, ensuring that the scrutiny of review that a proposal receives reflects the perceived risk [18].In a proportionate data governance system, projects which are triaged as low-risk may be fast-tracked for approval with minimal scrutiny whilst those which are high-risk could theoretically still be approved, but must be carefully considered by a committee.As explained by McGrail et al (2015), this differs from a risk-minimisation approach, whereby data is only released if risks are essentially absent [18].Central to the proportionate response is the understanding that data governance committees have a duty to share data appropriately as well as restrict data appropriately, as emphasised by the seventh Caldicott principle supported by the UK Government National Data Guardian for Health and Social Care, which suggests that the duty to share information can be just as important as the duty to protect confidentiality [19,20].Therefore, as well as being able to screen for high-risk projects, the tool should also be capable of triaging low-risk projects, so that these can benefit from proportionate by a review committee.The importance of this proportionate approach in reducing unnecessary impediments to low-risk projects has been highlighted in the Canadian Tri-Council Policy Statement on Ethical Conduct for Research Involving Humans (2018) [21].One of the most comparable data risk assessments to the GOSLING tool is the data governance of the ScottisH Informatics Programme (SHIP), which provides a comprehensive framework to analyse the risk-level of a project based on the: public interest of the project, the data being requested, the researchers involved and the environment of the project.The GOSLING tool builds upon the SHIP model, providing a quantifiable summation of risk measures that can be used for initial triage [22].
There are numerous frameworks that provide high-level guidance of principles that coordinating bodies should consider when appraising data access requests, however these usually involve a subjective review by a coordinating body on receipt of the application.The GOS-LING tool adds to existing frameworks by allowing researchers to self-assess their risk and make necessary adjustments prior to submission.The tool considers legal, governance and reputational risk factors in order to generate an aggregate score (Fig 1).Questions 2 to 4 broadly consider the legality of the data access request, collating details on the legal basis to access the data.Questions 5 to 8 focus upon plans for the governance and handling of the data and these questions may be altered to suit institutional requirements.Information collected from questions 9 to 11 may affect institutional risk and these questions provide the opportunity for researchers to reduce their risk score such as by having incorporated public engagement.In practice, many questions can affect more than one type of 'risk', for example adherence to legal frameworks would likely breach institutional governance requirements and may carry associated reputational risk.
The need for a standardised risk-assessment tool for access to data is perhaps most eloquently described by the 'Goldacre Review', commissioned by the UK government in 2021 to assess how to improve the use of health data for research and analysis in the UK: "The research and analytical community is extremely frustrated with the current arrangements around data access.Researchers and NHS service analysts can spend months or years trying to get multiple permissions from multiple parties. .." [23].
As well as providing a framework for data governance committees to review requests, the tool serves the dual purpose of allowing researchers to self-appraise the risk of their requests (Fig 2).Previous tools have focussed on highlighting factors that increase the risk and drawing attention to these.This tool purposefully includes potentially 'protective factors' that reduce the risk score by enhancing transparency.If the study has been subjected to external scrutiny, either by the public or other research committees, this increases the likelihood that research activities align with public interest and therefore reduces reputational risk.These factors include: • Patient and public involvement • A public protocol and plain English summary • Co-existent approvals for the project, including ethical approvals By including protective factors, we highlight to researchers that it may be possible to reduce the risk of a high-risk project to an acceptable level without changing the study design, but instead by including protective factors.By releasing the tool, with all scoring weights, we encourage full transparency with the public about how routinely collected health data is being used.The inclusion of patient and public involvement as a protective factor serves an additional purpose of encouraging researchers to engage with this activity early on in their research process, strengthening their data access application.Public and public members have been consulted in the design of the tool, consistent with the principle of 'participatory data governance', a modern feature of data governance committees [24,25].Members of data governance committees have also contributed to the design of the tool as co-authors of this manuscript and have been actively involved in its testing.In general, there was agreement between data access committee members and public members of the content of the tool.Public members suggested modifications to questions, which were incorporated.This both included suggesting new sources of risk that had not previously been considered (e.g.expanding the definition of sensitive data) and identifying areas of jargon that may not be easily interpretable by a nonspecialist audience.

Limitations
This data risk tool is specifically designed to appraise studies which are requesting data for routinely collected healthcare information, that do not have informed consent from patients.Where informed consent is obtained from patients the tool advises discussion with the governance team as there may be different legal and ethical processes for this.Although the tool could still be used for these consented data access requests, it is not intended for this purpose.The tool is intentionally designed not to appraise the worthiness of a research question but rather to focus on the data governance issues.In this way, the tool is not aimed towards use by funding bodies or ethics committees, who may focus on the utility of the research.The tool would not differentiate between two projects using the same data in the same manner for projects of varying utility.Although the tool incorporated guidance originating from outside the UK to inform the questions, it is UK-specific, referencing UK legislation and procedures.There is heterogeneity, even within the European Union, of processing of health data for research, including data linkage between databases [7].The tool could be modified to suit a different region, but this should be done in consultation with data protection experts in the region.Further research is required to evidence what, if any, time saving is afforded to researchers and committees by using the tool.The tool has been tested and agreed upon by three institutions, but a follow-up study comparing the adoption by more institutions in the future would help to review how different institutions view data access requests and adjust scoring accordingly.The small sample size of the study is a limitation and the tool requires more testing on more projects.Details of real studies that have been evaluated with the tool have not been published as these were tested retrospectively and under the expectation that the details of the studies and authors would not be published publicly.Testing on real projects that have undergone data governance review would bias against the most high-risk projects, which are less likely to reach panel review without derisking.The synthesised project proposals helped to fill this gap and they were intentionally designed to capture a variety of risk profiles, ranging from low to high risk.The tool requires testing in a real-world environment, where researchers self-score applications with the tool and completed scoring proforma are reviewed by a data access committee.Testing to date has been in a controlled setting, applied to either synthesised studies or retrospective application to existing projects and resultantly the expert review of the projects by the study group has not been blinded to the performance of the tool.Importantly, for future testing of the tool it should be applied to projects in a variety of settings (e.g.primary care, secondary care and within research organisations) and efforts could be made to translate it for application in other languages or geographies to assist with external validation of its content.

Future directions
This study represents, to our knowledge, the first publicly available tool for quantifying the relative risk of granting access to routinely collected health data for a project.This tool must not replace data access committees but rather it presents a mechanism for such committees to streamline the processing of data access requests and triage those which require most scrutiny.The tool might also be used by research governance teams within hospital trusts who triage and review requests, helping with the development of any data access or internal trust committees.We hope that its use may also lead to improved quality submissions of data access requests by allowing researchers to self-assess the risk of their project and showing them how risk mitigation measures may improve the likelihood of their access request being approved.

1 .
Project details (not used for scoring) 2. Eligibility and Data Usage (not used for scoring) 3. Types of data being requested 4. Special category data requests 5. Data sharing partners 6.Data access requirements 7. Security requirements for data storage 8. Data transfer between devices 9. Public involvement and engagement 10.Data transfer agreements 11.Further free-text information (not used for scoring)

•
31 to 69: Medium-risk: May require additional information or review • Over 70: High-risk: Requires a discussion with the Research and Development team prior to submission to the Data Governance Committee

Fig 1 .Fig 2 .
Fig 1. Data-related risk factors.High-level overview of data-related risk factors under the domains of legal, governance and reputational considerations.Note that in practice there is overlap between these three domains.https://doi.org/10.1371/journal.pone.0309308.g001

Table 2 . A sample of nine synthesised projects, ranging from low-to high-risk, subjected to the GOSLING tool. Synthesised project proposal summary Subjective risk estimate Specific risk or mitigating factors GOSLING score
https://doi.org/10.1371/journal.pone.0309308.t002